Atomically moving list elements between lists using read-copy update

ABSTRACT

A system, method and computer program product for atomically moving a shared list element from a first list location to a second list location includes inserting a placeholder element at the second list location to signify to readers that a move operation is underway, removing the shared list element from the first list location, re-identifying the list element to reflect its move from the first list location to the second list location, inserting it at the second list location and unlinking the placeholder element. A deferred removal of the placeholder element is performed following a period in which readers can no longer maintain references thereto. A method, system and computer program product are additionally provided for performing a lookup of a target list element that is subject to being atomically moved from a first list to a second list.

BACKGROUND OF THE INVENTION

1. Field of the Invention

The present invention relates to computer systems and methods in whichlist data are shared by software running concurrently on one or moreprocessors. More particularly, the invention concerns an improved systemand method that allows lock-free lookups of list elements whileefficiently permitting concurrent update operations in which listelements are moved from one list to another.

2. Description of the Prior Art

By way of background, shared data elements that are members of a linkedlist sometimes need to be moved from one list to another whilemaintaining consistency for the benefit of data consumers who may beconcurrently performing lookups on the same data. This situation arisesin the context of in-memory file system tree images used by operatingsystems to perform file name lookups for locating files maintained onblock storage devices. When a file's name is changed and/or the file ismoved from one directory to another (referred to as a “rename”operation), its corresponding entry in the file system tree image willoften move between lists. For example, in a typical directory entrycache, directory entry elements (representing files) are assigned todoubly-linked circular directory lists. Each such list is headed by aparent directory entry whose files are represented by the directoryentries in the list. Relocating a file from one directory to anotherwill cause its directory entry to move from one directory list toanother. Similarly, in a directory entry hash table, directory entriesare assigned to hash chains (lists) according to a hash algorithm basedon their name and name of their parent directory. Directory entries willtypically move from one hash chain to another whenever the file's nameis changed or it is relocated to another directory.

Techniques must be used to perform these list operations withoutimpacting readers who may be concurrently performing look-ups on thesame file. Moreover, in computing environments conforming to the POSIX(Portable Operating System Interface), the list manipulations must beperformed atomically. This atomicity requirement is illustrated in thecontext of the POSIX rename( ) system call by considering the situationwhere the rename( ) operation races with concurrent lookups of the oldfile name and the new file name. If a lookup of the new name succeeds,then every subsequent lookup of the old name must fail. Similarly, if alookup of the old name fails, then every subsequent lookup of the newname must succeed. Note that a “subsequent” lookup must start after apreceding lookup completes. This is summarized in the following table,in which the term “failed” signifies a failure to open the file beingrenamed: TABLE 1 POSIX rename( ) atomicity conditions Rename If open(“old”) failed, If open (“new”) succeeded, (“old”, “new”) Then open(“new”) Then open (“old”) must fail. must succeed

The atomicity requirements for the POSIX rename( ) system call are thesame whether a file is being renamed to a new name, and when a file isbeing renamed on top of a pre-existing file. In the latter case, an“early” attempt to open the new filename (i.e., before the rename( )operation returns) will fail to open the renamed file, but will insteadopen the pre-existing file. This race condition is in all waysequivalent to the race condition where the file is being renamed to anew name. Therefore, for simplicity, the ensuing discussion willconsider only the case where a file is renamed to a new name.

There are a number of prior-art algorithms that permit atomic rename( )operations by relying on locks held during the lookup operations. Thisis undesirable because directory cache lookups are extremely common, andsuch operations should be lock-free if possible. There are alsolock-free synchronization techniques that provide the desired semantics,and avoid locking in the lookups. However, these rename( ) operationsare extremely costly, requiring duplication of the entire data structure(which for a hash table can contain hundreds of thousands of elements,even on small desktop systems). Furthermore, even though the lookups arelock-free, they use atomic operations that perform write operations,thereby inflicting costly cache misses on lookups running in otherprocessors.

Another mutual exclusion technique, known as read-copy update, permitsshared data to be accessed for reading without the use of locks, writesto shared memory, memory barriers, atomic instructions, or othercomputationally expensive synchronization mechanisms, while stillpermitting the data to be updated concurrently. The technique is wellsuited to multiprocessor computing environments in which the number ofread operations (readers) accessing a shared data set is large incomparison to the number of update operations (updaters), and whereinthe overhead cost of employing other mutual exclusion techniques (suchas locks) for each read operation would be high.

The read-copy update technique implements data updates in two phases. Inthe first (initial update) phase, the actual data update is carried outin a manner that temporarily preserves two views of the data beingupdated. One view is the old (pre-update) data state that is maintainedfor the benefit of operations that may be currently referencing thedata. The other view is the new (post-update) data state that isavailable for the benefit of operations that access the data followingthe update. In the second (deferred update) phase, the old data state isremoved following a “grace period” that is long enough to ensure thatall executing operations will no longer maintain references to thepre-update data.

Traditional read-copy-update manipulation of list data leaves the olddata element in place in the list, creates a new copy with the desiredmodifications, and then atomically inserts the new copy in place of theold element into the same list. This is impractical for the POSIXrename( ) operation. Here, the old element must be atomically removedand a new element inserted, not necessarily in the same place that theold one occupied, but likely into a different list. File systemoperations further complexify traditional read-copy update due to theexistence of long-lived references to the old list element (directoryentry representing the file) that is to be removed following a graceperiod. It is often difficult or even infeasible to determine wherethese references are located, because many different parts of anoperating system kernel or of dynamically loaded kernel modules might atany time acquire a reference to the list element. Thus, there is noeffective method for tracking down all the possible references to theold element.

A possible work-around would be to have read-copy update atomicallyupdate an entire file system tree data structure, and atomically replaceit with a new one by switching pointers. However, as in the case oflock-free synchronization, this latter approach is hopelesslyinefficient for directories containing large numbers of files, and iseven less well suited to systems that maintain a hash table to cachefilename/directory mappings. As stated, it is not unusual for even smalldesktop machines to cache more than 100,000 such mappings. Making a newduplicate copy of this table for each rename( ) operation is clearlyundesirable. Another alternative, creating a copy of a single hash chainis not feasible because the rename( ) operation will normally move adirectory entry to some other hash chain. It is also not possible toatomically create a copy of only the affected pair of hash chains withthe instructions available on commodity microprocessors.

In sum, given current commodity microprocessor instruction sets, alongwith the undesirability of duplicating large list structures, it is notpractical to atomically move an element from one list to another usingtraditional read-copy update techniques. If the POSIX rename( )operation is not performed atomically, there will be a short butnon-zero duration when the renamed directory entry will not be on anylist. This time duration can be expanded by interrupts, ECC (ErrorCorrection Code) errors in memory or caches, or by many other eventsthat can occur in current microprocessors and operating systems. In amultiprocessor system, it is possible that some other process might beable to perform a lookup on the new name followed by the old name duringthis time interval and observe both failing, thus violating the requiredPOSIX semantics as shown in the second column of Table 1.

Accordingly, a need exists for an efficient lock-free technique foratomically moving shared list elements from one list to another. Itwould be particularly desirable to provide a solution to the foregoingproblem using existing aspects of the conventional read-copy updatetechnique but with modifications thereto to facilitate inter-listmovement of list elements with the required atomicity.

SUMMARY OF THE INVENTION

The foregoing problems are solved and an advance in the art is obtainedby a method, system and computer program product for atomically moving ashared list element from a first list location to a second list locationwhile permitting lock-free concurrent lookup operations. To perform theatomic move operation, a placeholder element is inserted at the secondlist location to signify to readers that a move operation is underway,and the shared list element is removed from the first list location. Theshared list element is then re-identified to reflect its move from thefirst list location to the second list location. It is inserted at thesecond list location and the placeholder element is unlinked. A deferredremoval of the placeholder element is performed following a period inwhich readers can no longer maintain references to the placeholderelement. Readers that were waiting on the placeholder element will failand presumably be retried, at which point the shared list element willbe found at its new location.

In exemplary embodiments of the invention, the placeholder elementincludes a flag that indicates when the move operation has completed,and readers performing lookups use a mechanism, such as a semaphore, anevent queue, or a wait queue, to wait until the flag signifiescompletion before returning. The placeholder element further includes areference count representing a count of readers maintaining referencesto the placeholder element (and thus waiting for completion of the moveoperation). This reference count is used in conjunction with read-copyupdate to defer release of the placeholder element until all readershave completed processing thereof.

The shared list element is not limited to any particular type of data,but one of its uses is in a doubly-linked circular list of directoryentries in a file system directory entry cache. The shared list elementcould also be a member of a directory entry hash table chain. In bothcases, the move operation can be part of a file rename( ) operation. Afurther example of the shared list element would be an individual rowelement in a relational database tuple. Many other list environmentswould likewise be candidates for implementation of the presentinvention.

A method, system and computer program product are additionally providedfor performing a lookup of a shared list element that is subject tobeing atomically moved from a first list to a second list. The lookupinitiates a list traversal beginning at a first list element. Uponencountering a list element that is the target of the lookup, the lookupreturns success. Upon encountering a list element that is a placeholderfor the lookup target that was generated as a result of a concurrentmove operation involving the target, the lookup waits until theplaceholder indicates that the move operation has completed.

When this occurs, the lookup returns failure so that the lookup can beretried. Upon the target list element or the placeholder not being foundin the list, the lookup returns failure.

In exemplary embodiments of the invention, the lookup further includesmaintaining a count of elements traversed by the lookup and asserting alock against concurrent move operations if the count reaches a computedmaximum. The lookup may additionally include determining whether thelookup has been pulled from one list to another as a result of aconcurrent move, and if true, returning to the initial list beingtraversed. The lookup increments a reference count in the placeholderupon encountering the placeholder and decrements the reference count ifthe placeholder indicates that the concurrent move operation hascompeted. When waiting on the placeholder, the lookup can block on aglobal or per-element semaphore, or spin (busy wait) on a global orper-element lock.

BRIEF DESCRIPTION OF THE DRAWINGS

The foregoing and other features and advantages of the invention will beapparent from the following more particular description of exemplaryembodiments of the invention, as illustrated in the accompanyingDrawings, in which:

FIGS. 1A-1D are diagrammatic representations of a linked list of dataelements undergoing a data element replacement according to aconventional read-copy update mechanism;

FIGS. 2A-2C are diagrammatic representations of a linked list of dataelements undergoing a data element deletion according to a conventionalread-copy update mechanism;

FIG. 3 is a flow diagram illustrating a grace period in which fourprocesses pass through a quiescent state;

FIG. 4 is a functional block diagram showing a multiprocessor computingsystem that represents one exemplary environment in which the presentinvention can be implemented;

FIG. 5 is a functional block diagram showing a read-copy updatesubsystem implemented by each processor in the multiprocessor computersystem of FIG. 4;

FIG. 6A-6F are diagrammatic representations showing the state of twolists as a list element on one list is moved to the other list inaccordance with the invention;

FIG. 7 is a flow diagram showing exemplary steps that may be used by anupdater to move a list element between lists in accordance with theinvention;

FIGS. 8A, 8B and 8C collectively illustrate a flow diagram showingexemplary steps that may be used by a reader to perform a lookup on alist element in accordance with the invention;

FIG. 9 is a diagrammatic representation of a birthstone element that maybe used in accordance with an exemplary embodiment of the presentinvention that performs a POSIX rename( ) operation on a list ofdirectory entry elements; and

FIG. 10 is a diagrammatic illustration of storage media that can be usedto store a computer program product for implementing read-copy updategrace period detection functions in accordance with the invention.

DETAILED DESCRIPTION OF EXEMPLARY EMBODIMENTS

Before discussing the details of the invention in its exemplaryembodiments, it will be helpful to consider several examplesillustrating the manner in which conventional read-copy update can beused to update list elements. FIGS. 1A-1D illustrate one such situationwherein a data element B in a group of data elements A, B and C is to bemodified. The data elements A, B, and C are arranged in a singly-linkedlist that is traversed in acyclic fashion, with each element containinga pointer to a next element in the list (or a NULL pointer for the lastelement) in addition to storing some item of data. A global pointer (notshown) is assumed to point to data element A, the first member of thelist.

It is assumed that the data element list of FIGS. 1A-1D is traversed(without locking) by multiple concurrent readers and occasionallyupdated by updaters that delete, insert or modify data elements in thelist. In FIG. 1A, the data element B is being referenced by a reader r1,as shown by the vertical arrow below the data element. In FIG. 1B, anupdater u1 wishes to update the linked list by modifying data element B.Instead of simply updating this data element without regard to the factthat r1 is referencing it (which might crash r1), u1 preserves B whilegenerating an updated version thereof (shown in FIG. 1C as data elementB′) and inserting it into the linked list. This is done by u1 acquiringa spinlock, allocating new memory for B′, copying the contents of B toB′, modifying B′ as needed, updating the pointer from A to B so that itpoints to B′, and releasing the spinlock. All subsequent (post update)readers that traverse the linked list, such as the reader r2, will thussee the effect of the update operation by encountering B′. On the otherhand, the old reader r1 will be unaffected because the original versionof B and its pointer to C are retained. Although r1 will now be readingstale data, there are many cases where this can be tolerated, such aswhen data elements track the state of components external to thecomputer system (e.g., network connectivity) and must tolerate old databecause of communication delays.

At some subsequent time following the update, r1 will have continued itstraversal of the linked list and moved its reference off of B. Inaddition, there will be a time at which no other reader process isentitled to access B. It is at this point, representing expiration of agrace period, that u1 can free B, as shown in FIG. 1D.

FIGS. 2A-2C illustrate the use of read-copy update to delete a dataelement B in a singly-linked list of data elements A, B and C. As shownin FIG. 2A, a reader r1 is assumed be currently referencing B and anupdater u1 wishes to delete B. As shown in FIG. 2B, the updater u1updates the pointer from A to B so that A now points to C. In this way,r1 is not disturbed but a subsequent reader r2 sees the effect of thedeletion. As shown in FIG. 2C, r1 will subsequently move its referenceoff of B, allowing B to be freed following expiration of a grace period.

In the context of traditional read-copy update, a grace periodrepresents the point at which all running processes having access to adata element guarded by read-copy update have passed through a“quiescent state” in which they can no longer maintain references to thedata element, assert locks thereon, or make any assumptions about dataelement state. For many types of shared data, a context (process)switch, an idle loop, and user mode execution all represent quiescentstates for any given CPU (as can other operations that will not belisted here).

In FIG. 3, four processes 0, 1, 2, and 3 running on four separate CPUsare shown to pass periodically through quiescent states (represented bythe double vertical bars). The grace period (shown by the dottedvertical lines) encompasses the time frame in which all four processeshave passed through one quiescent state. If the four processes 0, 1, 2,and 3 were reader processes traversing the linked lists of FIGS. 1A-1Dor FIGS. 2A-2C, none of these processes having reference to the old dataelement B prior to the grace period could maintain a reference theretofollowing the grace period. All post grace period searches conducted bythese processes would bypass B by following the links inserted by theupdater.

There are various methods that may be used to implement a deferred dataupdate following a grace period, including but not limited to the use ofcallback processing as described in commonly assigned U.S. Pat. No.5,727,209, entitled “Apparatus And Method For Achieving Reduced OverheadMutual-Exclusion And Maintaining Coherency In A Multiprocessor SystemUtilizing Execution History And Thread Monitoring.” The contents of U.S.Pat. No. 5,727,209 are hereby incorporated herein by this reference.

The callback processing technique contemplates that an updater of ashared data element will perform the initial (first phase) data updateoperation that creates the new view of the data being updated, and thenspecify a callback function for performing the deferred (second phase)data update operation that removes the old view of the data beingupdated. The updater will register the callback function (hereinafterreferred to as a callback) with a read-copy update subsystem so that itcan be executed at the end of the grace period. The read-copy updatesubsystem keeps track of pending callbacks for each processor andmonitors per-processor quiescent state activity in order to detect whena current grace period has expired. When it does, all scheduledcallbacks that are ripe for processing are executed.

The present invention represents an extension of the read-copy updatemutual exclusion technique wherein, instead of replacing an old listelement with a new one, the old element is atomically moved to a newlist location so that the references to this element need not bechanged. The invention achieves this effect by inserting a temporaryplaceholder element, referred to as a “birthstone,” at the destinationlocation where the real list element is to be moved. Lookups finding thebirthstone will wait until the move operation is complete beforereturning failure. The birthstone element is maintained until it isreplaced by the actual element being moved. At this point, thebirthstone is marked “complete.” Readers waiting on the birthstone“complete” state will then fail but a retry of the lookup will beattempted and the actual element will be successfully found at its newlocation. It is thus guaranteed that any lookup that fails to find theold element will subsequently find the new one, consistent with POSIXrequirements, after the element is inserted into its new location (e.g.,a new directory list or a new hash chain, depending on theimplementation). As described in more detail below, reference counts andread-copy update are used to guarantee that concurrent lookups see validstate information at all points.

Turning now to FIG. 4, an exemplary computing environment in which thepresent invention may be implemented is illustrated. In particular, asymmetrical multiprocessor (SMP) computing system 2 is shown in whichmultiple processors 4 ₁, 4 ₂ . . . 4 _(n) are connected by way of acommon system bus 6 to a shared memory 8. Respectively associated witheach processor 4 ₁, 4 ₂ . . . 4 _(n) is a conventional cache memory 10₁, 10 ₂ . . . 10 n and a cache controller 12 ₁, 12 ₂ . . . 12 _(n). Aconventional memory controller 14 is associated with the shared memory8. The computing system 2 is assumed to be under the management of asingle multitasking operating system adapted for use in an SMPenvironment.

It is further assumed that update operations executed within processes,threads, or other execution contexts will periodically perform updateson a shared set of linked lists 16 stored in the shared memory 8. By wayof example only, the shared list set 16 could be a directory entry cacheor hash table and the lists thereof could contain file system directoryentry elements. It will be appreciated that the invention may also beused in connection with many other types of lists. Reference numerals 18₁, 18 ₂ . . . 18 _(n) illustrate individual data update operations(updaters) that may periodically execute on the several processors 4 ₁,4 ₂ . . . 4 _(n). In the present case, the updates performed by theupdaters 18 ₁, 18 ₂ . . . 18 _(n) involve moving a list element from onelist to another, such as could occur if a directory entry element isrenamed and moved between lists in a directory entry cache or hashtable. In that case, the renaming of an element would in many casescause it to hash to a different hash chain. To facilitate such updates,the several processors 4 ₁, 4 ₂ . . . 4 _(n) are programmed to implementa read-copy update (RCU) subsystem 20, as by periodically executingrespective read-copy update instances 20 ₁, 20 ₂ . . . 20 _(n) as partof their operating system functions.

The processors 4 ₁, 4 ₂ . . . 4 _(n) also execute readers 22 ₁, 22 ₂ . .. 22 _(n) that perform lookup operations on the shared list set 16. Eachlookup operation is assumed to entail an element-by-element traversal ofa linked list until an element which is the target of the lookup isfound. If the shared list set 16 is a directory entry cache or hashtable, the linked list being traversed will be selected according to thename and parent directory of the lookup target. Such lookup operationswill typically be performed far more often than updates, thus satisfyingone of the premises underlying the use of read-copy update.

As shown in FIG. 5, each of the read-copy update subsystem instances 20₁, 20 ₂ . . . 20 _(n) includes a callback registration component 24. Thecallback registration component 24 serves as an API (Application ProgramInterface) to the read-copy update subsystem that can be called by theupdaters 18 ₂ . . . 18 _(n) to register requests for deferred (secondphase) data element updates following initial (first phase) updatesperformed by the updaters themselves. As is known in the art, thesedeferred update requests involve the removal of stale data elements, andwill be handled as callbacks within the read-copy update subsystem 20.Each of the read-copy update subsystem instances 20 ₁, 20 ₂ . . . 20_(n) additionally includes a conventional read-copy update grace perioddetection mechanism 26, together with a callback processing mechanism 28adapted to process callbacks registered by the updaters 18 ₁, 18 ₂ . . .18 _(n). Note that the functions 26 and 28 can be implemented as part ofa kernel scheduler, as is conventional.

Overview Of Atomic Move of List Element Between Lists Using Read-CopyUpdate

As mentioned above, the present invention applies read-copy update tothe situation where a list element needs to be moved from one list toanother. This is done by inserting a “birthstone” into the element's newlocation, which is later replaced with the actual element being moved.If a reader 22 ₁, 22 ₂ . . . 22 _(n) performing a lookup sees thebirthstone, it waits for the move operation to complete before failing.FIGS. 6A-6F and the flow diagram of FIG. 7 illustrate the basictechnique for a pair of circularly linked lists L1 and L2, in whichelement B of list L1 is to be moved to list L2, and renamed to element Nin the process. FIG. 6A shows the initial state of lists L1 and L2. Toinitiate the move operation, the updater 18 ₂ . . . 18 _(n) performingthe operation implements a conventional mutual-exclusion mechanism inorder to prevent a concurrent move operation involving B fromdestructively interfering with the move operation of this example.Locking is one mutual-exclusion mechanism that may be used. As shown instep 30 of FIG. 7, once mutual exclusion has been implemented, theupdater 18 ₂ . . . 18 _(n) creates a birthstone for N and links it intolist L2. The resulting state is shown in FIG. 6B. The birthstone for Nis a list element having all the attributes of the list element N thatit represents, except that it is designated (using a flag or otherparameter) as a temporary placeholder so that readers 22 ₁, 22 ₂ . . .22 _(n) performing lookups on element N will know that it is abirthstone for N. Another parameter associated with the birthstone for Nis a reference count showing the number of readers 22 ₁, 22 ₂ . . . 22_(n) that are currently referencing the birthstone. This reference countis initially set to 1 when the birthstone for N is created, and isthereafter incremented and decremented by readers 22 ₁, 22 ₂ . . . 22_(n) encountering the birthstone (as described in more detail below). Asfurther described below, a reference count value of zero signifies tothe read-copy update subsystem 20 that the birthstone can be safelyremoved. As can be seen from FIG. 6B, after the completion of step 30 ofFIG. 7, a lookup traversing list L1 to locate element B will stillsucceed, while a lookup traversing list L2 to locate element N will findthe birthstone for N. This will cause the reader 22 ₁, 22 ₁ . . . 22_(n) to wait until the move of B is complete.

In step 32 of FIG. 7, element B is unlinked from list L1, and theelement preceding B is linked to the element following B. This state isshown in FIG. 6C. Any lookup of element B will now fail. Because thepresent example includes element B being renamed to element N, step 34of FIG. 7 is implemented to perform an element re-identificationoperation that results in the element's name being changed. There-identification operation can also be used to change other identifyinginformation, such as a parent directory identifier if the element is afile system directory entry and the file is being moved from onedirectory to another. If the file is moved to a new directory but notrenamed, the element re-identification operation will only change theparent directory identifier. If the file is also being renamed, both itsname and parent directory identifier will change. Note that there-identification operation should be atomic. If element B cannot bere-identified atomically (e.g., there are existing references to thiselement), the updater 18 ₂ . . . 18 _(n) must call the read-copy updatesubsystem 20 to track a grace period. This will guarantee that noreaders 22 ₁, 22 ₁ . . . 22 _(n) can be maintaining references toelement B when it is re-identified. FIG. 6D shows the resultant state ofelement B following the completion of step 34 of FIG. 7. At this point,new lookups for element B will continue to fail, and lookups for elementN will still find the birthstone for N. In accordance with POSIXrequirements, any lookup for element N that succeeds will subsequentlyfail to find B.

It is possible as a result of step 34 that lookups for another elementin the list L1, such as element J, will have been carried with element Bto the list L2. That is, lookups can be “pulled” to a different list bymoving a list element that is currently being consulted as part of alist traversal sequence at the time of the move. For example, if list L1is being traversed during a lookup of element J, and the lookup isreferencing element B at the same time element B is moved to list L2,the lookup for element J can be pulled to L2. This race condition can beresolved by maintaining a back pointer (not shown) from each listelement to the corresponding list header element, then restarting thesearch if the wrong back pointer is encountered (see lookup techniquebelow for further details). Thus, a simple check can detect and recoverfrom this possible race.

In step 36 of FIG. 7, the newly renamed element N is linked into list L2and the birthstone for N is unlinked so that no new lookups will findthis element. FIG. 6E, shows this state. Lookups for B will continue tofail, but lookups for N now find the newly renamed element instead ofits birthstone. In step 38 of FIG. 7, the birthstone for N is marked“complete,” and will thereafter be removed and returned to free memoryusing read-copy update and the birthstone's reference count. Read-copyupdate is used to defer freeing the birthstone for N until a graceperiod has elapsed. This ensures that any readers 22 ₁, 22 ₁ . . . 22_(n) with newly acquired references to the birthstone that did not havetime to process the birthstone by the time it is unlinked, will havepassed through a quiescent state. Therefore, the updater 18 ₁, 18 ₂ . .. 18 _(n) invokes the callback registration component 24 (FIG. 5) of theread-copy update subsystem 20 to schedule a callback. The grace perioddetection component 26 (FIG. 5) of the read-copy update subsystem 20waits until a grace period has expired and the callback processingcomponent 28 (FIG. 5) performs callback processing on the birthstone forN. In this case, such callback processing comprises decrementing thebirthstone's reference count and testing its value. If it is zero, allreaders 22 ₁, 22 ₁ . . . 22 _(n) will have completed their processing ofthe birthstone (which processing will include each reader firstincrementing, then decrementing the reference count). This will signifythat no readers can possibly maintain a reference to the birthstone. Thecallback processing component 28 may then safely free up the birthstone.FIG. 6F shows this state. Alternatively, if the reference count is notzero at the end of the grace period, callback processing will beterminated without freeing the birthstone. As described below in thediscussion on lookup technique, the responsibility for removing thebirthstone for N will now fall on the last reader 22 ₁, 22 ₁ . . . 22_(n) waiting for the birthstone to complete. This reader will decrementthe reference count to zero, then free the birthstone.

Lookup Technique for Use With Atomic Move of List Element Between Lists

Any reader 22 ₁, 22 ₁ . . . 22 _(n) performing lookups on list elementsthat may be concurrently moved between lists during the lookup operationmust be adapted to correctly handle this situation. FIGS. 8A-8Cillustrate exemplary steps that may be performed. In step 40, the reader22 ₁, 22 ₁ . . . 22 _(n) performing the lookup identifies the list wherethe element should be located. If the list represents a hash chain in ahash table, this would require locating the correct hash bucket usingthe applicable hash algorithm. In step 42, a loop is initiated andseveral actions are taken for each list element that is encountered bythe lookup when traversing the list. In step 44, a count of the numberof elements traversed is incremented and a test is made in step 46whether the count exceeds a computed maximum. If it does, the readerimplements a suitable lock to lock out further move operations, thenresets the count to zero in step 48, and continues. The reason for doingthis is that a sequence of moves that results in an element beingreturned to the same list it started in could possibly cause the lookupto revisit some elements, potentially looping indefinitely through thelist. The count is used so that if a large number of elements areencountered, a lock can be implemented to prevent further moveoperations on the target element. Note that this check can be disposedof if move operations occur infrequently. One way to enforce thiscondition is to require that a grace period elapse between each moveoperation, although this could unacceptably throttle the rate at whichmoves occur. Another alternative would be to require that a grace periodelapse for a given number of move operations, although this couldunnecessarily block move operations if grace periods are blocked by someunrelated operation. A further alternative would be to block moveoperations if an ongoing grace period extends for too long, althoughthis could also unnecessarily block move operations if grace periods areblocked by some unrelated operation. Despite the foregoing drawbacks,one significant advantage of this limitation on move operations is thatit removes the reader 22 ₁, 22 ₁ . . . 22 _(n) from the burden of thecounter check and locking sequence of steps 46 and 48, thus speeding upthe lookup, perhaps significantly.

Another way to prevent indefinite looping during lookups is to have theupdater 18 ₂ . . . 18 _(n), when manipulating an element, check to seeif the element will end up in the same list that it started in. If so,the updater 18 ₂ . . . 18 _(n) can insert a birthstone before theelement rather than replacing it. This guarantees that a move cannotcause a lookup to visit more entries than it would otherwise have tosee. However, for this to work, an element that has been moved cannot bemoved again until all in-flight lookups complete. Otherwise, lookupscould be recycled by renaming an element back and forth between twolists. As described above, this can be guaranteed by having the updater18 ₂ . . . 18 _(n) refuse to move a recently moved element until after agrace period has elapsed since its previous move.

If a count procedure is to be used per steps 46 and 48 of FIG. 8A, themaximum count value may be computed in a number of ways. For example, acount of the number of entries in each list can be maintained in thatlist's header element. If the lookup traverses more than this number ofentries, the reader is very likely traversing elements multiple timesdue to multiple moves of the element being searched. The reader couldalso keep a count of the total number of entries in the shared list set16, and use a function of that count and the number of lists to estimatethe maximum list length. Prior art techniques are available forperforming this calculation on hash tables.

No matter which maximum count computation technique is used, the readers22 ₁, 22 ₁ . . . 22 _(n) should always be sensitive to excessivelylocking out move operations. Thus, the count function should be adjustedto choose a desired tradeoff between readers 22 ₁, 22 ₁ . . . 22 _(n)potentially having to traverse large numbers of list elements andupdaters 18 ₁, 18 ₂ . . . 18 _(n) performing move operations beingneedlessly locked out.

In step 50 of FIG. 8A, an optional (read) memory-barrier instruction canbe executed on processors with extremely weak memory-consistency modelsthat do not respect data dependencies. In step 52, the current listelement's back pointer (not shown) to its list header element is checkedto determine if the lookup operation has been pulled off of the list onwhich the lookup began (as discussed above). If it has, the back pointerwill have changed and the reader 22 ₁, 22 ₂ . . . 22 _(n) must go backto the beginning of the original list and retraverse it. If in step 52,the current element's back pointer does not match the original listheader element on which the lookup began, the updater needs to return tothe original list and retraverse it (without resetting the elementcount). But first, a test is made in step 54 to determine if there hasbeen a previous retraversal that ended up at the same element with thesame non-matching back pointer. If so, the lookup is failed and any moveoperations that were locked out are allowed to continue. Otherwise, thelookup goes back to the original list and retraverses it beginning instep 42. If there was a back pointer match in step 52, the lookupoperation moves to step 56 in FIG. 8B and tests whether the currentelement is the lookup target. If not, the lookup operation returns tostep 42 and proceeds to the next list element in the current list, ifone exists. If a name match is found in step 56, a test is made in step58 whether the current element is a birthstone. If not, the lookupreturns success in step 60 and any move operations that were locked outare allowed to continue. If the current element is a birthstone, thelookup operation proceeds to step 62 in which any rename operations thatwere locked cut are allowed to continue and the reader 22 ₁, 22 ₂ . . .22 _(n) atomically increments the birthstone's reference count. In step64 the reader 22 ₁, 22 ₂ . . . 22 _(n) waits for the move to completeand then moves to step 66 in which the birthstone's reference count isatomically decremented. As described in more detail below, there are anumber of ways that a reader 22 ₁, 22 ₂ . . . 22 _(n) can wait for amove to complete. Although this will cause the lookup to wait, this waitwill be no more severe than the delay inherent in traditional algorithmswhere list updates lock out all lookups. In step 68 of FIG. 8C, thereference count is tested. If it is zero, the birthstone is freed instep 70. In either case, any move operations that were locked areallowed to continue and the lookup is failed. If the lookup finds noelements matching the lookup target, step 42 will result in no moreelements being found and the lookup will fail. Any move operations thatwere locked out will then be allowed to continue.

As indicated, there are a number of ways a reader 22 ₁, 22 ₂ . . . 22_(n) can wait for a move operation to complete. Each has differentadvantages in different situations. For example, one technique would beto have the lookups block (sleep) on a global semaphore (sometimesreferred to as a “sleeplock”). This minimizes memory use, because thereis only one semaphore. However, it can result in needless wakeups whenthere are multiple moves executing in parallel. It also causes thesystem to incur the overhead of an additional context switch each time alookup encounters a birthstone. Another alternative would be to havelookups block on a per-element semaphore. This requires additionalmemory, but eliminates the needless wakeups. It still incurs thecontext-switch overhead. A further alternative would be to have thelookups spin (busy wait) on a global lock. This, although possible, isalmost always undesirable due to the potential lock contention overhead.A still further alternative would be to have lookups spin on aper-element lock. This likely requires no additional memory, and thismethod is preferred as long as the move operation does not block and islikely to complete quickly.

Atomic Rename( ) Using Read-Copy Update

The atomic POSIX rename( ) problem described by way of background abovecan be solved using the above-described technique to insert a“birthstone” at the destination of a file system directory entry to berenamed. As per the discussion above, when a lookup operation encountersa birthstone that matches the lookup target, it blocks until thebirthstone is marked “completed”, and then fails (and is presumably thenretried). This ensures that any operation that fails to see the old filename will see the new file name on any subsequent lookup, and alsoensure that any operation that sees the new file name will fail to seethe old file name, as required by POSIX semantics.

As shown in FIG. 9, an exemplary birthstone 80 for use in the POSIXrename( ) operation may be constructed with the same fields used inother directory entry elements, but with the addition of a refcountfield and a flags field containing a flag to indicate when thebirthstone is “complete,” as follows:

-   -   refcount: a reference counter    -   flags: status flags    -   parent: a pointer to the parent directory    -   hash: a list of entries in the same hash chain (in systems using        a global hash table for pathname-component translation)    -   hashchain: a pointer to the head of the hash chain (in systems        using a global hash table for pathname-component translation)    -   child: a list of siblings that are children of the same        directory    -   subdirs: list of children of this directory    -   name: a pointer to a structure containing the pointer to the        name, its length, and a hash value (in systems using a global        hash table).

The rename( ) operation is implemented according to the generalizedatomic move procedure described above, with the target list elementbeing a file system directory entry and the list being a doubly-linkedcircular directory list in a directory entry cache or a hash chain in adirectory entry hash table. As is known, the rename( ) operation can beused to change a file's name without moving it to a different directory,or it can move the file to a different directory without changing itsname, or it can both rename the file and move it to a differentdirectory. Because of the way lists are implemented in a directory entrycache or hash table, the rename( ) operation usually results in adirectory entry element being moved from one list to another. Of course,if the rename( ) operation results in the directory entry elementremaining on the same list, the birthstone procedure described hereinmay not be necessary. If the element name can be renamed atomically,then such an operation can be used. Lookups will see either the old nameor the new name.

Assuming that a birthstone is required, it will typically have the samename, hash, hash chain, child, and parent as the element will have afterbeing renamed, but with the requisite indication (e.g., a flag value,bit or other parameter) in the “flags” field of FIG. 9 that this is abirthstone.

When using the generalized atomic move operation of FIG. 7 to perform arename( ) operation using birthstones, the re-identification step 34requires special consideration when there is a global hash table (seebelow). Special handling is also required if short names are kept withinthe list element itself (see below).

Special Handling for Global Hash Table

As mentioned above, the re-identification step 34 of FIG. 7 will requirean update to a directory entry element's name, but not its parentdirectory pointer, if the corresponding file is being renamed but notmoved to a new directory. Similarly, the re-identification step 34 willrequire an update to a directory entry element's parent directorypointer, but not its name, if the corresponding file is keeping its oldname and is just being moved to a different directory. However, if thecorresponding file is being renamed and moved to a new directory, there-identification step 34 of FIG. 7 will result in the directory entryelement's name and parent directory pointer both being changed.

For best results in this situation when there is a global hash table,the name and parent pointer of each directory entry element should bechanged atomically. Although it is possible to change them one at atime, doing so makes lookups considerably more complex. This atomicupdate may be accomplished by placing the parent directory pointer intoa structure that also contains the element's name, so that the “name”field of the element points to both of them (and so that the “parent”field is not required).

However, if the parent pointer is frequently referenced, this might haveunacceptable performance consequences. In this case, it may be better tokeep the “name” and “parent” fields, and also provide a special pointerin each element that is normally NULL, but, when non-NULL, points to aspecial structure containing both the parent pointer and the name. Insome cases, this structure can simply be another list element.

It will be appreciated that the special-pointer approach should providesome way of propagating the atomic update to the special name/parentpointer structure back to the parent and name stored in the main elementcontaining the pointer. This can be accomplished by registering acallback to invoke a function (after a grace period elapses) that doesthe following:

-   -   a) Copy the new name and the new parent pointer from the        separate storage back into the list element. This may be done        safely because the grace period has elapsed, ensuring that no        processes are still referencing the old name and parent pointer.    -   b) On processor architectures with weak memory consistency,        execute a (write) memory-barrier instruction.    -   c) Copy the special pointer to a temporary variable.    -   d) Set the special pointer to NULL.    -   e) Register a callback to free up the special structure that        contained the new name and parent pointer (which is now pointed        to by the temporary variable) after a grace period elapses.        Special Handling for Short Names in Elements

If a directory entry's name is stored directly in the list element (andnot as a pointer), it cannot be changed atomically. Therefore, the newname must be placed in separate storage as a long name would be, even ifthe new name is short enough to fit into the list element. This allowsthe name to be changed atomically. When it is desirable to move theshort name back into the list element, a callback can be registered toinvoke a function (after a grace period elapses) that does thefollowing:

-   -   a) Copy the new name from the separate storage back into the        list element. This may be done safely because the grace period        has elapsed, ensuring that no processes are still referencing        the old name.    -   b) On processor architectures with weak memory consistency,        execute a (write) memory-barrier instruction.    -   c) Copy the “name” pointer to a temporary variable.    -   d) Point the “name” pointer to the internal storage.    -   e) Register a callback to free up the old name (pointed to by        the temporary variable) after a grace period elapses.    -   Lookup of Directory Entry Element While Taking Into Account        Concurrent Rename( )

In cases where a directory cache or hash table is subject to theabove-described rename( ) operation, the lookup procedure describedabove in connection with FIGS. 8A-8C may be used for performingconcurrent file name lookups. The only refinement occurs at step 52 ofFIG. 8A depending on whether the directory entry being renamed( ) is alist element in a directory list or a list element in a hash tablechain. In the former case, the list on which the element is located willbe the usual linked list of directory entries that extend from a parentdirectory entry through all of the entry's representing subdirectoriesor files of the parent. Step 52 of FIG. 8A will then check the element's“parent” pointer against the parent directory entry at the head of thelist where the lookup started. If the list on which the element islocated is a hash chain in a hash table, step 52 of FIG. 8A will checkthe element's “hashchain” pointer against the initial hash bucket listelement where the lookup started. In all other respects, the lookupoperation of FIGS. 8A-8C remains the same.

Accordingly, a technique has been disclosed for atomically moving listelements from one list to another using read-copy update. It will beappreciated that the foregoing concepts may be variously embodied in anyof a data processing system, a machine implemented method, and acomputer program product in which programming means are recorded on oneor more data storage media for use in controlling a data processingsystem to perform the required functions. Exemplary data storage mediafor storing such programming means are shown by reference numeral 100 inFIG. 10. The media 100 are shown as being portable optical storage disksof the type that are conventionally used for commercial software sales.Such media can store the programming means of the invention either aloneor in conjunction with an operating system or other software productthat incorporates read-copy update functionality. The programming meanscould also be stored on portable magnetic media (such as floppy disks,flash memory sticks, etc.) or on magnetic media combined with drivesystems (e.g. disk drives) incorporated in computer platforms.

While various embodiments of the invention have been described, itshould be apparent that many variations and alternative embodimentscould be implemented in accordance with the invention. It is understood,therefore, that the invention is not to be in any way limited except inaccordance with the spirit of the appended claims and their equivalents.

1. A method for atomically moving a shared list element from a firstlist location to a second list location, comprising: inserting aplaceholder element at said second list location for use by readers tomonitor moving of said shared list element; removing said shared listelement from said first list location; re-identifying said shared listelement to reflect its move from said first list location to said secondlist location; inserting said shared list element at said second listlocation and unlinking said placeholder element; performing deferredremoval of said placeholder element following a period in which readersmaintain no references to said placeholder element.
 2. A method inaccordance with claim 1, wherein said placeholder element includes anindication of whether said move operation has completed.
 3. A method inaccordance with claim 1, wherein said placeholder element includes areference count representing a count of readers maintaining referencesto said placeholder element, and wherein said period is determined bysaid reference count and by readers of said placeholder element passingthrough a quiescent state following said unlinking.
 4. A method inaccordance with claim 1, wherein said shared list element is a directoryentry element in a file system directory list and said move operation ispart of a file rename( ) operation.
 5. A method in accordance with claim1, wherein said shared list element is a directory entry element in afile system hash table chain and said move operation is part of a filerename( ) operation.
 6. A method for performing a lookup of a targetlist element that is subject to being atomically moved from a first listto a second list, comprising: initiating a list traversal beginning at afirst list element; and upon encountering a list element that is aplaceholder for said target list element that was generated as a resultof a concurrent move operation involving said target list element,waiting until said placeholder indicates that said move operation hascompleted, and thereafter returning failure so that said lookup can beretried.
 7. A method in accordance with claim 6, further includingmaintaining a count of elements traversed by said lookup and asserting alock against concurrent move operations if said count reaches a computedmaximum.
 8. A method in accordance with claim 6, further includingdetermining whether said lookup has been pulled from one list to anotheras a result of a concurrent move, and if true, returning to the initiallist being traversed.
 9. A method in accordance with claim 6, furtherincluding incrementing a reference count in said placeholder uponencountering said placeholder and decrementing said reference count ifsaid placeholder indicates that said concurrent move operation hascompeted.
 10. A method in accordance with claim 6, wherein said waitingon said placeholder comprises one of blocking on a global semaphore,blocking on a per-element semaphore, spinning on a global lock, andspinning on a per-element lock.
 11. A data processing system havingplural data processors, said system being adapted to atomically move ashared list element from a first list location to a second listlocation, comprising: means for inserting a placeholder element at saidsecond list location for use by readers to monitor moving of said sharedlist element; means for removing said shared list element from saidfirst list location; means for re-identifying said shared list elementto reflect its move from said first list location to said second listlocation; means for inserting said shared list element at said secondlist location and unlinking said placeholder element; means forperforming deferred removal of said placeholder element following aperiod in which readers maintain no references to said placeholderelement.
 12. A system in accordance with claim 11, wherein saidplaceholder element includes an indication of whether said moveoperation has completed.
 13. A system in accordance with claim 11,wherein said placeholder element includes a reference count representinga count of readers maintaining references to said placeholder element,and wherein said period is determined by said reference count and byreaders of said placeholder element passing through a quiescent statefollowing said unlinking.
 14. A system in accordance with claim 11,wherein said shared list element is a directory entry element in a filesystem directory list and said move operation is part of a file rename() operation.
 15. A system in accordance with claim 11, wherein saidshared list element is a directory entry element in a file system hashtable chain and said move operation is part of a file rename( )operation.
 16. A data processing system adapted to perform a lookup of atarget list element that is subject to being atomically moved from afirst list to a second list, comprising: means for initiating a listtraversal beginning at a first list element; and means responsive toencountering a list element that is a placeholder for said target listelement that was generated as a result of a concurrent move operationinvolving said target list element for waiting until said placeholderindicates that said move operation has completed, and thereafterreturning failure so that said lookup can be retried.
 17. A system inaccordance with claim 16, further including means for maintaining acount of elements traversed by said lookup and for asserting a lockagainst concurrent move operations if said count reaches a computedmaximum.
 18. A system in accordance with claim 16, further includingmeans for determining whether said lookup has been pulled from one listto another as a result of a concurrent move, and if true, returning tothe initial list being traversed.
 19. A system in accordance with claim16, further including means for incrementing a reference count in saidplaceholder upon encountering said placeholder and decrementing saidreference count if said placeholder indicates that said concurrent moveoperation has competed.
 20. A system in accordance with claim 16,wherein said means for waiting on said placeholder comprises one ofmeans for blocking on a global semaphore, means for blocking on aper-element semaphore, means for spinning on a global lock, and meansfor spinning on a per-element lock.
 21. A computer program productatomically moving a shared list element from a first list location to asecond list location, comprising: one or more data storage media; meansrecorded on said data storage media for programming a data processingplatform to operate as by: inserting a placeholder element at saidsecond list location for use by readers to monitor moving of said sharedlist element; removing said shared list element from said first listlocation; re-identifying said shared list element to reflect its movefrom said first list location to said second list location; insertingsaid shared list element at said second list location and unlinking saidplaceholder element; performing deferred removal of said placeholderelement following a period in which readers maintain no references tosaid placeholder element.
 22. A computer program product in accordancewith claim 21, wherein said placeholder element includes an indicationof whether said move operation has completed.
 23. A computer programproduct in accordance with claim 21, wherein said placeholder elementincludes a reference count representing a count of readers maintainingreferences to said placeholder element, and wherein said grace period isdetermined by said reference count and by readers of said placeholderelement passing through a quiescent state following said unlinking. 24.A computer program product in accordance with claim 21, wherein saidshared list element is a directory entry element in a file systemdirectory list and said move operation is part of a file rename( )operation.
 25. A computer program product in accordance with claim 21,wherein said shared list element is a directory entry element in a filesystem hash table chain and said move operation is part of a filerename( ) operation.
 26. A computer program product for performing alookup of a target list element that is subject to being atomicallymoved from a first list to a second list, comprising: one or more datastorage media; means recorded on said data storage media for programminga data processing platform to operate as by: initiating a list traversalbeginning at a first list element; and upon encountering a list elementthat is a placeholder for said target list element that was generated asa result of a concurrent move operation involving said target listelement, waiting until said placeholder indicates that said moveoperation has completed, and thereafter returning failure so that saidlookup can be retried.
 27. A computer program product in accordance withclaim 26, further including maintaining a count of elements traversed bysaid lookup and asserting a lock against concurrent move operations ifsaid count reaches a computed maximum.
 28. A computer program product inaccordance with claim 26, further including determining whether saidlookup has been pulled from one list to another as a result of aconcurrent move, and if true, returning to the initial list beingtraversed.
 29. A computer program product in accordance with claim 26,further including incrementing a reference count in said placeholderupon encountering said placeholder and decrementing said reference countif said placeholder indicates that said concurrent move operation hascompeted.
 30. A computer program product in accordance with claim 26,wherein said waiting on said placeholder comprises one of blocking on aglobal semaphore, blocking on a per-element semaphore, spinning on aglobal lock, and spinning on a per-element lock.